R Package Identification for GitHub Repositories

In this notebook, we load the data collected from GitHub API v3 (see GitHub - Repositories from API notebook) and we check for every repository if this repository has a DESCRIPTION file at its root. This is the condition we use to identify which repositories are storing a R package.


In [1]:
import gzip
import json
import requests
import pandas

from collections import OrderedDict

In [2]:
INPUT_FILENAME = '../data/R-apiv3-2015-01-01T00:00:00-2015-06-01T00:00:00.tar.gz'
OUTPUT_FILENAME = '../data/RPackage-Repositories-150101-150601.csv' 

# Set of attributes that will be kept for the output
# If a dot is found in the attribute, a nested lookup will be performed
ATTR = [
    'full_name', 
    'name',
    'owner.login',
    'owner.type',
    'created_at', 
    'description', 
    'forks_count',
    'stargazers_count',
    'watchers_count',
    'has_downloads',
    'has_pages', 
    'has_issues', 
    'has_wiki',
]

We make use of IPython's parallel computation.

To use this notebook, you need either to configure your IPController or to start a cluster of IPython nodes, using ipcluster start -n 4 for example. See https://ipython.org/ipython-doc/dev/parallel/parallel_process.html for more information.

It seems that most recent versions of IPython Notebook can directly start cluster from the web interface, under the Cluster tab.


In [3]:
from IPython import parallel
clients = parallel.Client()
clients.block = False
print 'Clients:', str(clients.ids)


Clients: [0, 1, 2, 3]

We first load and identify which are the distinct repositories that were gathered using GitHub API v3.


In [4]:
with gzip.GzipFile(filename=INPUT_FILENAME) as gf:
    content = gf.read()

In [5]:
content = json.loads(content)

In [6]:
print '{} items were retrieved from GitHub API v3'.format(len(content))


36880 items were retrieved from GitHub API v3

In [7]:
distinct = {(r['name'], r['owner']['login'], r['full_name']): r for r in content}
print '{} distinct items inside'.format(len(distinct))


32562 distinct items inside

We now filter the items to keep only interesting attributes.


In [8]:
def filter_attributes(item, attributes):
    new_item = OrderedDict()
    for attr in attributes:
        if '.' in attr:
            attr1, attr2 = attr.split('.')
            new_item['{}.{}'.format(attr1, attr2)] = item[attr1][attr2]
        else:
            new_item[attr] = item[attr]
    return new_item

In [9]:
items = map(lambda r: filter_attributes(r, ATTR), distinct.values())

And we're ready to check if the repository has a DESCRIPTION file at its root.


In [10]:
def check_item(item):
    url = 'https://raw.githubusercontent.com/{}/master/DESCRIPTION'.format(item['full_name'])
    response = requests.get(url)
    if response.status_code == 200:
        item['package'] = 1
    else:
        item['package'] = 0
    return item


print len(items), 'items to check'
clients[:].execute('import requests')
balanced = clients.load_balanced_view()
res = balanced.map(check_item, items, ordered=False, timeout=15)

import time
while not res.ready():
    time.sleep(5)
    print res.progress, ' ',


32562 items to check
617   731   843   942   1050   1160   1266   1368   1492   1601   1700   1807   1918   2035   2150   2258   2351   2463   2579   2691   2730   2797   2908   2995   3116   3238   3352   3454   3560   3673   3788   3909   4020   4138   4226   4342   4454   4571   4685   4786   4899   5015   5124   5237   5341   5462   5573   5687   5807   5921   6039   6148   6258   6378   6500   6617   6718   6818   6933   7046   7159   7279   7377   7453   7570   7687   7805   7929   8049   8136   8200   8316   8430   8544   8632   8723   8844   8960   9074   9191   9298   9402   9515   9635   9746   9847   9947   10044   10154   10270   10380   10485   10606   10710   10826   10939   11044   11129   11226   11320   11412   11504   11599   11690   11730   11785   11887   11995   12114   12211   12322   12435   12551   12663   12765   12870   12980   13093   13209   13315   13410   13517   13628   13743   13851   13968   14069   14184   14304   14422   14538   14645   14736   14837   14944   15059   15173   15275   15366   15463   15577   15692   15798   15915   16021   16120   16238   16347   16450   16565   16665   16757   16840   16890   16973   17064   17155   17247   17359   17474   17572   17684   17776   17865   17969   18085   18195   18313   18426   18537   18652   18763   18877   18990   19105   19216   19331   19449   19566   19681   19795   19909   20024   20136   20252   20364   20455   20565   20676   20789   20904   21016   21127   21239   21355   21460   21573   21676   21778   21895   22008   22114   22217   22313   22427   22537   22649   22759   22857   22927   23042   23153   23263   23374   23455   23568   23680   23796   23905   24012   24102   24210   24322   24425   24527   24630   24744   24829   24937   25047   25146   25232   25334   25442   25550   25654   25766   25865   25927   26029   26135   26244   26359   26469   26584   26701   26816   26927   27044   27151   27257   27370   27482   27588   27697   27794   27903   28009   28120   28228   28344   28454   28568   28686   28790   28886   28948   28959   28978   28993   29035   29150   29246   29361   29477   29593   29706   29788   29901   30009   30120   30238   30336   30449   30544   30654   30758   30858   30968   31084   31198   31309   31423   31534   31646   31759   31863   31967   32079   32194   32304   32414   32519   32562  

In [22]:
results = [r[0] for  r in res._result if not isinstance(r, parallel.error.RemoteError)]

df = pandas.DataFrame(results).query('package == 1')[ATTR].set_index('full_name')
df.to_csv(OUTPUT_FILENAME, encoding='utf-8')

In [23]:
print len(df), 'packages found'


6532 packages found